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Abstract — The  recently  proposed  covariance  region  descriptor  has  been  proven  robust  and  versatile  for  a  modest  computational 
cost.  The  covariance  matrix  enables  efficient  fusion  of  different  types  of  features,  where  the  spatial  and  statistical  properties  as  well 
as  their  correlation  are  characterized.  The  similarity  between  two  covariance  descriptors  is  measured  on  Riemannian  manifolds. 
Based  on  the  same  metric,  but  with  a  probabilistic  framework,  we  propose  a  novel  tracking  approach  on  Riemannian  manifolds 
with  a  novel  incremental  covariance  tensor  learning  (ICTL).  To  address  the  appearance  variations,  ICTL  incrementally  learns  a 
low-dimensional  covariance  tensor  representation  and  efficiently  adapts  online  to  appearance  changes  of  the  target  with  only  0(1) 
computational  complexity,  resulting  in  a  real-time  performance.  The  covariance-based  representation  and  ICTL  are  then  combined 
with  the  particle  filter  framework  to  allow  better  handling  of  background  clutter  as  well  as  the  temporary  occlusions.  We  test  the 
proposed  probabilistic  ICTL  tracker  on  numerous  benchmark  sequences  involving  different  types  of  challenges  including  occlusions 
and  variations  in  illumination,  scale,  and  pose.  The  proposed  approach  demonstrates  excellent  real-time  performance,  both  qualitatively 
and  quantitatively,  in  comparison  with  several  previously  proposed  trackers. 

Index  Terms — Visual  tracking,  particle  filter,  covariance  descriptor,  Riemannian  manifolds,  incremental  learning,  model  update. 
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1  Introduction 

Visual  tracking  is  a  challenging  problem,  which  can  be  at¬ 
tributed  to  the  difficulty  in  handling  the  appearance  variability 
of  a  target.  In  general,  appearance  variations  can  be  divided 
into  two  types:  intrinsic  and  extrinsic.  The  intrinsic  appear¬ 
ance  variations  include  pose  change  and  shape  deformation, 
whereas  the  extrinsic  variations  include  changes  in  illumi¬ 
nation  and  camera  viewpoint,  and  occlusions.  Consequently, 
it  is  imperative  for  a  robust  tracking  algorithm  to  model 
such  appearance  variations  to  ensure  real-time  and  accurate 
performance. 

Appearance  models  in  visual  tracking  approaches  are  often 
sensitive  to  the  variations  in  illumination,  view,  and  pose. 
Such  sensitivity  results  from  a  lack  of  a  competent  object 
description  criterion  that  captures  both  statistical  and  spatial 
properties  of  the  object  appearance.  Recently,  the  covariance 
region  descriptor  (CRD)  is  proposed  in  [39]  to  address  these 
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sensitivities  by  capturing  the  correlations  among  extracted 
features  inside  an  object  region. 

Using  the  CRD  as  the  appearance  model,  we  propose  a 
novel  probabilistic  tracking  approach  via  Incremental  Covari¬ 
ance  Tensor  Learning  (ICTL).  In  contrast  to  the  covariance 
tracking  algorithm  [33],  with  the  tensor  analysis,  we  simplify 
the  complex  model  update  process  on  the  Riemannian  mani¬ 
fold  by  computing  the  weighted  sample  covariance,  which  can 
be  updated  incrementally  during  the  object  tracking  process. 
Thus  our  appearance  model  can  update  more  efficiently,  adapt 
to  extrinsic  variations,  and  afford  object  identification  with 
intrinsic  variations  -  which  is  the  main  contribution  of  our 
work.  Further,  our  ICTL  method  uses  a  particle  filter  [13] 
for  motion  parameter  estimation  rather  than  the  exhaustive 
search-based  method  [33]  which  is  very  time-consuming  and 
often  distracted  by  outliers.  Moreover,  the  integral  image  data 
structure  [32]  is  adopted  to  accelerate  the  tracker. 

In  summary,  our  proposed  tracking  framework  includes 
two  stages:  (a)  probabilistic  Bayesian  inference  for  covariance 
tracking;  and  (b)  incremental  covariance  tensor  learning  for 
model  update.  In  the  first  stage,  the  object  state  is  obtained  by 
a  maximum  a  posterior  (MAP)  estimation  within  the  Bayesian 
state  inference  framework  in  which  a  particle  filter  is  applied  to 
propagate  sample  distributions  over  time.  In  the  second  stage, 
a  low  dimensional  covariance  model  is  learned  online.  The 
model  uses  the  proposed  ICTL  algorithm  to  find  the  compact 
covariance  representation  in  the  multi-modes.  After  the  MAP 
estimation  of  the  Bayesian  inference,  we  use  the  covariance 
matrices  of  image  features  associated  with  the  estimated  target 
state  to  update  the  compact  covariance  tensor  model  for  each 
mode.  The  two  stage  architecture  is  executed  repeatedly  as 
time  progresses  as  shown  in  Fig.  1.  Moreover,  with  the  use 
of  tensors  of  integral  images,  our  tracker  achieves  real-time 
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Input  frame 


Fig.  1 .  Overview  of  the  proposed  tracking  approach. 

performance. 

2  Related  work 

There  is  a  rich  literature  in  visual  tracking  and  a  thorough 
discussion  on  this  topic  is  beyond  the  scope  of  this  paper. 
There  are  many  uses  of  covariance  information  in  target  track¬ 
ing  such  as  covariance  intersection  for  measurement-based 
tracking  from  multiple  sensors  [14],  covariance  control  for 
sensor  scheduling  and  management  [16],  [41],  etc.  Given  the 
widespread  use  of  covariance  analysis  in  target  tracking,  in  this 
section  we  review  only  the  most  relevant  visual  tracking  work 
that  motivated  our  approach,  focusing  on  target  representation 
and  model  update. 

2.1  Target  representation 

Target  representation  is  one  of  major  components  in  typical 
visual  trackers  and  extensive  studies  have  been  presented. 
Histograms  prove  to  be  a  powerful  representation  for  an  image 
region.  Discarding  the  spatial  information,  the  color  histogram 
is  robust  to  the  change  of  object  pose  and  shape.  Several 
successful  tracking  approaches  utilize  color  histograms  [8], 
[26].  Recently,  Stanley  et  al.  [6]  proposed  a  novel  histogram, 
named  spatiogram,  to  capture  not  only  the  values  of  the 
pixels  but  their  spatial  relationships  as  well.  To  calculate  the 
histogram  efficiently,  Porikli  et  al.  [32]  proposed  a  fast  way 
to  extract  histograms  called  the  integral  histogram.  Recently, 
sparse  representation  has  been  introduced  for  visual  tracking 
via  the  i\ -minimization  [23]  and  been  further  extended 
in  [25],  [44],  [17],  [22],  [42],  [21]. 

The  covariance  region  descriptor  (CRD)  proposed  in  [39] 
has  been  proved  to  be  robust  and  versatile  for  a  modest 
computational  cost.  The  CRD  has  been  applied  to  many 
computer  vision  tasks,  such  as  object  classification  [12],  [38], 
[36],  human  detection  [40],  [28],  face  recognition  [29],  action 
recognition  [10]  and  tracking  [33],  [46],  [45],  [43].  The 
covariance  matrix  enables  efficient  fusion  of  different  types 
of  features  and  its  dimensionality  is  small.  An  object  window 
is  represented  as  the  covariance  matrix  of  features,  where 
the  spatial  and  statistical  feature  properties  as  well  as  their 
correlations  are  characterized  within  the  same  representation. 
The  similarity  of  two  covariance  descriptors  is  measured  on 
Riemannian  manifolds  which  we  call  the  Manifold  Covariance 
Similarity  (MCS)  metric.  Porikli  et  al.  [33]  generalized  the 
covariance  descriptor  to  a  tracking  problem  by  exhaustively 
searching  the  whole  image  for  the  region  that  best  matches 
the  model  descriptor  (i.e.  maximum  likelihood  estimation  - 


MLE).  Using  the  MLE  covariance  descriptor  is  time  consum¬ 
ing,  computationally  inefficient,  easily  affected  by  background 
clutter,  and  ineffective  over  occlusions. 

Improvement  for  such  situations  is  one  of  the  benefits  of 
our  proposed  probabilistic  ICTL  tracking  approach.  Relying  on 
the  same  MCS  metric  to  compare  two  covariance  descriptors, 
we  embed  it  within  a  sequential  Monte  Carlo  framework.  To 
utilize  the  MCS  requires  building  Riemannian  manifold  local 
likelihoods,  coupling  the  manifold  observation  model  with  a 
dynamical  state  space  model,  and  sequentially  approximating 
the  posterior  distribution  with  a  particle  filter.  Using  the 
sample-based  filtering  technique  enables  tracking  multiple 
posterior  modes,  which  is  the  key  to  mitigate  background 
distractions  and  to  recover  after  temporary  occlusions. 

2.2  Appearance  variations  modeling 

To  model  the  appearance  variations  of  a  target,  there  have  been 
many  visual  tracking  approaches  reported  in  the  last  decades. 
Zhou  et  al.  [48]  embedded  appearance  adaptive  models  into 
a  particle  filter  to  achieve  a  robust  visual  tracking.  In  [34], 
Ross  et  al.  proposed  a  generalized  visual  tracking  framework 
based  on  the  incremental  image-as-vector  subspace  learning 
methods  with  a  sample  mean  update.  The  sparse  representation 
of  target  [24],  [25]  is  updated  by  introducing  importance 
weights  for  the  templates  and  identifying  rarely  used  templates 
for  replacement.  To  handle  appearance  changes,  SVT  [3] 
integrates  an  offline  trained  support  vector  machine  (SVM) 
classifier  into  an  optic-flow-based  tracker.  In  [7],  the  most 
discriminative  RGB  color  combination  is  learned  online  to 
build  a  confidence  map  in  each  frame.  In  [4],  an  ensemble 
of  online  learned  weak  classifiers  is  used  to  label  a  pixel  as 
belonging  to  either  the  object  or  the  background.  To  encode 
the  object  appearance  variations,  Yu  et  al.  [47]  proposed  to  use 
co-training  to  combine  generative  and  discriminative  models 
to  learn  an  appearance  model  on-the-fly.  In  [15],  Kalal  et 
al.  proposed  a  learning  process  guided  by  positive  and  negative 
constraints  to  distinguish  the  target  from  background. 

For  visual  target  tracking  with  a  changing  appearance,  it 
is  likely  that  recent  observations  will  be  more  indicative  of 
its  appearance  than  more  distant  ones.  One  way  to  balance 
old  and  new  observations  is  to  allow  newer  images  to  have 
a  larger  influence  on  the  estimation  of  the  current  appearance 
model  than  the  older  ones.  To  do  this,  a  forgetting  factor  is 
incorporated  in  the  incremental  eigenbasis  updates  in  [19]. 
Further,  Ross  et  al.  [34]  provided  an  analysis  of  its  effect  on 
the  resulting  eigenbasis.  Skocaj  and  Leonardis  [37]  presented 
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an  incremental  method,  which  sequentially  updates  the  prin¬ 
cipal  subspace  considering  weighted  influence  of  individual 
images  as  well  as  individual  pixels  in  an  image. 

However,  appearance  models  adopted  in  the  above  men¬ 
tioned  trackers  are  usually  sensitive  to  the  variations  in  il¬ 
lumination,  view  and  pose.  These  tracking  approaches  lack 
a  competent  object  description  criterion  that  captures  both 
statistical  and  spatial  properties  of  the  object  appearance. 
The  covariance  region  descriptor  (CRD)  [39]  is  proposed 
to  characterize  the  object  appearance,  which  is  capable  of 
capturing  the  correlations  among  extracted  features  inside  an 
object  region  and  is  robust  to  some  appearance  variations.  In 
the  recently  proposed  covariance  tracking  approach  [33],  the 
Riemannian  mean  under  the  affine-invariant  metric  is  used  to 
update  the  target  model.  Nevertheless,  the  computational  cost 
for  the  Riemannian  mean  grows  rapidly  as  time  progresses  and 
is  very  time-consuming  for  long-term  tracking.  Based  on  the 
Log-Euclidean  Riemannian  metric  [2],  Li  et  at.  [20]  presented 
an  online  subspace  learning  algorithm  which  models  the 
appearance  changes  by  incrementally  learning  an  eigenspace 
representation  for  each  mode  of  the  target  through  adaptively 
updating  the  sample  mean  and  eigenbasis. 

Our  work  is  motivated  in  part  by  the  prowess  of  covariance 
descriptor  as  appearance  models  [39],  the  effectiveness  of 
particle  filters  [13],  and  the  adaptability  of  on-line  update 
schemes  [34].  In  contrast  to  the  covariance  tracking  algo¬ 
rithm  [33],  our  algorithm  does  not  require  a  complex  model 
update  process  on  Riemannian  manifold  but  learns  the  com¬ 
pact  covariance  tensor  representation  incrementally  during  the 
object  tracking  process.  Thus  our  appearance  model  can  update 
more  efficiently.  Further,  our  method  uses  a  particle  filter  for 
motion  parameter  estimation  rather  than  the  exhaustive  search- 
based  method  [33]  which  is  very  time-consuming  and  often 
distracted  by  outliers.  Moreover,  with  the  help  of  integral 
images  [32],  our  tracker  achieves  real-time  performance.  A 
preliminary  conference  version  of  this  paper  appears  in  [43]. 


3  Probabilistic  covariance  tracking 

In  this  section,  we  first  review  the  covariance  descriptor  [39] 
and  particle  filter  [13],  then  the  probabilistic  covariance  track¬ 
ing  approach  is  introduced. 


3.1  Covariance  descriptor 

Let  I  be  the  observed  image,  and  F  be  the  W  x  H  x  d  dimen¬ 
sional  feature  image  extracted  from  /,  F(x,y)  =  4>(/,  x,y)9 
where  <f>  can  be  any  mapping  such  as  color,  gradients,  filter 
responses,  etc.  Let  be  the  d-dimensional  feature  points 

inside  a  given  rectangular  region  R  of  F.  The  region  R  is 
represented  by  the  d  x  d  covariance  matrix  of  the  feature  points 

1  N 

c  =  jy—[  X(^  -  m)(/»  - 

where  N  is  the  number  of  pixels  in  the  region  R  and  fi  is  the 
mean  of  the  feature  points. 


The  element  (i,  j)  of  C  represents  the  correlation  between 
feature  i  and  feature  j.  When  the  extracted  d-dimensional  fea¬ 
ture  includes  the  pixel’s  coordinate,  the  covariance  descriptor 
encodes  the  spatial  information  of  features. 

With  the  help  of  integral  images,  the  covariance  descriptor 
can  be  calculated  efficiently  [39].  Specifically,  d(d  +  l)/2 
integral  images  are  used  such  that  the  covariance  descriptor 
of  any  rectangular  region  can  be  computed  independent  of  the 
region  size. 

3.1.1  Metric  on  Riemannian  manifolds 
Supposing  no  features  in  the  feature  vector  would  be  exactly 
identical,  the  covariance  matrix  is  positive  definite.  Thus 
the  nonsingular  covariance  matrix  can  be  formulated  as  a 
connected  Riemannian  manifold,  which  is  locally  similar  to 
a  Euclidean  space.  For  differentiable  manifolds,  the  derivative 
at  a  point  X  lies  in  its  tangent  space  denoted  as  Tx.  Each 
tangent  space  has  an  inner  product  (•,  -)x  and  the  norm  for  a 
tangent  vector  is  defined  by  ||y||x  =  (y,  y)x. 

An  invariant  Riemannian  metric  on  the  tangent  space  is 
defined  as  (y,z)x  =  tr  (x~^yX_1zX~^.  The  exponen¬ 
tial  map  associated  to  the  Riemannian  metric  is  given  by 

i  /  _ i  _i\i 

expx  (y)  =  X 2  exp  (X  2  yX  2  1  X 2 .  The  logarithm  uniquely 
defined  at  all  the  points  on  the  manifold  is  logx  (y)  = 
X5  log  ( X~iyX~ 5)  X5. 

For  a  symmetric  matrix,  its  exponential  and  logarithm 
are  given  respectively  by  exp  (X)  =  V  exp  (D)  UT,  and 
log  (£)  =  £/  log  (D)  UT ,  where  X  =  U DUT  is  the  eigenvalue 
decomposition  of  the  symmetric  matrix  X.  exp(D)  and  log (19) 
are  the  diagonal  matrix  of  the  eigenvalue  exponentials  and 
logarithms  respectively. 

The  distance  between  symmetric  positive  definite  matrices 
is  measured  by 

d2  (X,  Y)  =  (logx  (Y) ,  logx  (Y))x  =  tr  (log^X^Y X"*))  . 


3.2  Sequential  Inference  Model 

In  the  Bayesian  perspective,  object  tracking  can  be  viewed  as  a 
state  estimation  problem.  At  time  t ,  denote  the  state  of  a  target 
and  its  corresponding  observation  as  xt  and  yu  respectively. 
The  state  set  from  beginning  to  time  t  is  xo-.t,  where  xq  is  the 
initial  state,  and  the  corresponding  observation  set  is  yo:t. 

The  purpose  of  tracking  is  to  predict  the  future  location 
and  estimate  the  current  state  given  all  previous  observations 
or  equivalently  to  construct  the  filtering  distribution  p(xt\yo:t). 
Using  the  conditional  independence  properties,  we  can  formu¬ 
late  the  density  propagation  for  the  tracker  as  follows: 

p(xt\yo-.t)  ccp(yt\xt)  J  p(xt\xt-i)p(xt-i\y0:t-i)dxt-i. 

For  visual  tracking  problems,  the  recursion  can  be  accom¬ 
plished  within  a  sequential  Monte  Carlo  framework  where  the 
posterior  p(xtjyo:t)  is  approximated  by  a  weighted  sample 
set  {xt>wt}n=l9  where  wt  =  1-  All  the  particles  are 

sampled  from  a  proposal  density  q(x^\x^_1:yt).  The  weight 
associated  with  each  particle  is  formulated  as  follows: 


w 


n 

t 


(X 


q(xt\xT-nyt)  Wt_1 
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To  avoid  weight  degeneracy,  the  particles  are  resampled  so 
that  all  of  them  have  equal  weights  after  resampling. 

The  common  choice  of  proposal  density  is  by  taking 
q(xt\xt-i,yt)  =  p(xt\xt-i).  As  a  result,  the  weights  become 
the  local  likelihood  associated  with  each  state  w ™  oc  p(yt  \%t)- 
The  Monte  Carlo  approximation  of  the  expectation  xt  = 

Ns 

xt  ~  E{xt\l/0:t)  is  used  for  state  estimation  at  time  t. 

s  n=  1 

3.3  Probabilistic  covariance  tracking 

Based  on  the  same  Manifold  Covariance  Similarity  (MCS) 
metric  to  compare  two  covariance  descriptors  on  the  Rieman- 
nian  manifolds,  the  probabilistic  covariance  tracking  approach 
embeds  the  MCS  metric  within  a  sequential  Monte  Carlo 
framework.  To  develop  the  manifold  covariance  approach 
requires  the  building  of  a  local  likelihood  on  Riemannian  man¬ 
ifolds,  the  coupling  of  the  observation  model  with  a  dynamical 
state  space  model,  and  the  sequential  approximation  of  the 
posterior  distribution  with  a  particle  filter.  The  sample-based 
filtering  technique  enables  tracking  the  multiple  posterior 
modes,  which  is  the  key  to  mitigate  the  effects  of  background 
distractors  and  to  recover  from  temporary  occlusions. 

Specifically,  to  measure  the  similarity  between  covariance 
matrices  corresponding  to  the  target  model  C*  and  the  candi¬ 
date  C(x™),  we  use  the  Manifold  Covariance  Similarity  metric 
on  Riemannian  manifolds.  An  exponential  function  of  the 
distance  is  adopted  as  the  local  likelihood  in  the  particle  filter: 
P(yt\xt)  ocexp{-A d?(C*,C(x?))}. 

4  Incremental  Covariance  Tensor 
Learning  for  Model  Update 

The  main  challenge  of  visual  tracking  can  be  attributed  to  the 
difficulty  in  handling  the  appearance  variability  of  a  candidate 
object.  To  address  the  model  update  problem,  we  present  a 
model  update  scheme  to  incrementally  learn  a  low-dimensional 
covariance  tensor  representation  and  consequently  adapts  on¬ 
line  the  appearance  changes  with  a  constant  computational 
complexity.  Moreover,  a  weighting  scheme  is  adopted  to  en¬ 
sure  less  modeling  power  is  expended  to  fit  older  observations 
with  existing  models.  Both  of  these  features  significantly 
contribute  to  improve  overall  real-time  tracking  performance. 
In  the  following,  we  provide  a  detailed  discussion  of  our 
proposed  Incremental  Covariance  Tensor  Learning  (ICTL) 
algorithm  for  model  update. 

4.1  Object  representation 

In  our  tracking  framework,  an  object  is  represented  by  multiple 
covariance  matrices  of  the  image  features  inside  the  object  re¬ 
gion,  as  shown  in  Fig. 2.  These  covariance  matrices  correspond 
to  the  multiple  modes  of  the  object  appearance.  Without  loss 
of  generality,  we  only  discuss  one  mode  in  the  following. 

As  time  progresses  from  t  =  1, . . . ,  T,  all  the  object 
appearances  form  object  appearance  tensor  A  =  {At  G 
Rrnxn}J=1,  and  d-dimensional  feature  vector  is  extracted 
for  each  element  of  At  forming  a  4th-order  object  feature 
tensor  T  G  RmxnxdxT .  Flattening  J we  can  obtain  the 


(a)  (b) 


Fig.  2.  Illustration  of  object  representation,  the  flattening 
of  T  and  two  different  formulations  for  CT-  The  input 
sequence  is  shown  in  the  upper  part  of  (a)  while  the  fourth 
order  object  feature  tensor  T  is  displayed  in  the  middle  of 
(a).  The  result  of  flattening  T  is  exhibited  in  the  bottom 
of  (a).  The  appearance  tensor  A  with  mode  division  is 
shown  in  the  top  of  (b)  while  the  covariance  tensor  for 
one  mode  in  the  middle  of  (b). .The  bottom  of  (b)  displays 
two  different  formulations  for  CT ■ 

matrix  comprising  its  mode-3  vector  (i.e.,  each  column  is  a 
d-dimensional  feature  vector): 

F  =  (/l,l,l/l,l,2  '  '  •  f  1,2,1  '  '  '  /2,1,1  '  '  '  ft,y,x  '  '  '  /T,m,n)j 

where  ft,y,x  denotes  a  d-dimensional  feature  vector  at  location 
(x,  y )  at  time  t.  Reforming  x  and  y  into  one  index  i,  F  can 
be  represented  neatly  by 

F  =  (/i,i  •  •  •  /i,iv  fr,N )  =  (Fi  •  •  -Ft  •  •  -Ft), 

where  N  =  m  x  n,  Ft  =  (ftjl  •  •  •  ft,i  •  •  •  /t,w)  G  Rdx^n ). 

The  column  covariance  of  Ft  can  be  represented  as: 

1  N 

Ci  =  y~[  (/m  -  Mt)  ( ft,i  -  ih)  , 

1  i=  1 

where  fit  is  the  column  mean  of  Ft.  This  covariance  can  be 
viewed  as  an  informative  region  descriptor  for  an  object  [39]. 
All  the  covariance  matrices  up  to  time  T,  {Ct  G  Rdxd}t=1 , 
constitute  a  covariance  tensor  C  G  RdxdxT .  We  need  to  track 
the  changes  of  C  and  as  new  data  arrives,  update  the  compact 
representation  of  C. 

A  straightforward  compact  representation  of  C  is  the  mean 
of  {Ct  G  Rdxd}^=1.  Porikli  et  al.  [33]  calculated  the  mean 
of  several  covariance  matrices  through  Riemannian  geometry. 
The  metric  they  used  is  the  affine-invariant  Riemannian  metric. 
The  distance  between  two  covariance  matrices  X  and  Y  under 
this  Riemannian  metric  is  computed  by  ||log(X-  2  YX“2  )||. 
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An  equivalent  form  is  given  in  [9] 

p^)=\jr^=  iin2Afe(x,Y) ,  CD 

where  \k{X,  Y)  are  the  generalized  eigenvalues  of  X  and 
Y.  Under  this  metric,  an  iterative  numerical  procedure  [30] 
is  applied  to  compute  the  Riemannian  mean.  The  computa¬ 
tional  cost  for  this  Riemannian  mean  grows  linearly  as  time 
progresses.  In  the  following,  we  propose  a  novel  compact 
representation  of  C,  which  can  be  updated  in  constant  time 
by  avoiding  the  computation  of  the  Riemannian  mean. 


4.2  Incremental  Covariance  Tensor  Learning 

From  a  generative  perspective,  pt  and  Ct  are  generated  from 
Ft  and  the  covariance  tensor  C  is  generated  from  the  feature 
tensor  T.  Therefore,  the  compact  tensor  representation  can  be 
obtained  directly  from  T.  We  get  the  compact  representation 
by  computing  the  column  covariance  of  F: 

T  N 

® T  =  NT~-  nXX  ( ft,i  ~  Pt)  ( ft,i  ~  At)T  , 

1  t= l  i= l 

where  At  is  the  column  mean  of  F.  Although  (4.2)  is  arguably 
straightforward,  it  is  computationally  expensive  and  needs  a 
large  amount  of  memory  to  store  all  the  previous  observations. 
Here,  we  propose  a  novel  formulation  that  could  be  computed 
efficiently  with  only  0(d2)  arithmetic  operations. 

We  treat  (4.2)  as  a  sample  covariance  estimation  problem 
by  considering  each  column  ft,i  of  F  as  a  sample.  As  time 
progresses,  the  sample  set  F  grows  and  our  aim  is  to  incremen¬ 
tally  update  the  sample  covariance.  In  order  to  moderate  the 
balance  between  old  and  new  observations,  each  sample  ftli 
is  associated  with  a  weight,  allowing  newer  samples  to  have  a 
larger  influence  on  the  estimation  of  the  current  covariance 
tensor  representation  than  the  older  ones.  As  a  result,  the 
covariance  estimation  problem  can  be  reformulated  as  estimat¬ 
ing  the  weighted  sample  covariance  of  F.  Furthermore,  under 
formulation  (4.2),  it  is  unnecessary  to  normalize  the  object 
appearance  to  the  same  size  as  [20].  In  the  following,  we  use 
Nt  to  denote  the  size  of  the  object  at  time  t. 

One  of  the  critical  issues  for  our  formulation  is  the  design 
of  the  sample  weight.  Four  issues  are  considered  to  chose  the 
sample  weight:  1)  the  weight  of  each  sample  should  vary  over 
time  T;  2)  the  samples  from  current  time  T  should  have  the 
higher  weights  than  previous  samples;  3)  the  weight  should  not 
affect  the  fast  covariance  computation  using  integral  images; 
and  4)  the  weight  should  not  affect  the  ability  to  incremental 
obtain  the  covariance  tensor  representation.  Therefore,  when 
the  current  time  is  T,  the  sample  weight  at  time  t  is  set  as 
wT~l,  where  w  G  [0,1],  t  G  [1  ,T].  With  this  weight  setting, 
the  samples  at  the  same  time  share  the  same  weight  and 
the  weighted  sample  covariance  of  F  can  be  incrementally 
updated. 

To  obtain  an  efficient  algorithm  to  update  the  covariance 
tensor  representation,  we  put  forward  the  following  definitions 
and  theorem. 


Definition  1.  Denote  the  weighted  samples  up  to  current  time 
T  as 

where  WT,t,i  A  the  weight  of  sample  ft^.  Let  the  number  of 
samples  in  Ft  be  Nt  and  the  sum  of  weights  in  Ft  be  wt, 
namely  NT  =  J2t=i  Nt  and  =  Yh=i  Yh= i  wT,t,i ■ 

Definition  2.  Let  Ct,  Pt  he  the  weighted  covariance  and 
the  weighted  sample  mean  at  time  t,  respectively.  Denote  the 
weighted  covariance  and  the  weighted  sample  mean  of  FT  as 
Ct  and  A t>  respectively.  The  formulation  of  Ct  and  At  are 
as  follows: 


N 


Ci 


~  1-w2t  YX 


t=l  i= 1 


™T,t,i 

Wt 


( ft,i  ~  At)  ( ft,i  ~  At)  ,  (2) 


where 


T  Nt 


Wn 


=  XX 

t= 1  i=  1 


wT,t, 

Wt 


T  Nt 


5  At  = — yz  yz  wT,t,ift 

wt  _ 


t= 1  i= 1 


Let  weights  of  all  samples  at  time  t  be  equal,  the  formulation 
of  Ct,  Pt  are  as  follows: 


Ct  = 


1 


Nt-  1 


Nt 

i=  1 


Theorem  1.  Given  Ct,  Pt>  Ct- i,  Pt- i>  wt- i^t-v  tf 
wr,t,i  =  u)T~fy  w  G  [0, 1],  it  can  be  shown  that: 

Ct  =  „  ~  F)t~i)Ct-i  +  (NT  —  1  )Ct 

Wt{  1  —  Wp) 

WWt-iNt  .  *  w  *  S.T, 

H - 7 - (Mt  —  At-i)(Mt  —  At-i)  } 

wt 

(3) 


where  wT  =  wwT- 1  +  NT,  At  =  WW?T  1  p,T- 1  +  fjrMT, 
w'j:  =  (wt-iwt-i  nt-i)w  +Njl  ^e  initial  conditions  are 
C\  =  C\,  pi  =  pi,  wi  =  Ni,  and  w\  =  1/TVi. 


To  make  the  proof  of  Theorem  1  concise,  we  give  some 
lemmas  first.  The  proof  of  all  the  lemmas  appears  in  the 
Appendix. 


Lemma  1.  If  WT,t,i 

wwt-i  +  Nt,  and  w 


=  w 


T-t 


2  _  (^T- 

T  — 


G 

1  Wt-1 


[0, 1],  we  have  wt  = 

—  Nt— i )  u>2 + Nt 


Lemma  2.  Ef=1  Eti  -  At)  =  0  and 

J2t= 1  ~  At)T  =  0. 

Lemma  3.  If  weights  of  all  the  samples  at  time  T  are  equal, 
then  X1^i(/t,2  — At)(/t,z  — At)t  =  (NT  —  1)Ct+Nt(pt  — 
Pt){pt  ~  At)t- 


Lemma  4.  If  WT,t,i  =  tot  1  ,w  G  [0,1],  we  have  pt  = 
Pt-i  +  ^pPr>  At— i  -  At  =  Pt  ~  At- i)>  and 


Pt  ~  Pt  =  1  (pt  ~  At-i). 
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Lemma  5.  If  wr,t,i  =  wT  t,  re  G  [0,1],  we  /zave 
EfE  EE  wt,m(/«  -  Ar)(/t,i  -  Ar)T  =  wwt-i(1  - 
^T_i)Cr-i  +  u’'w)t-i(At-i  —  At)(At-i  ~  At)t- 

Proof  of  Theorem  1: 

T  N 

By  definition,  CT  =  ^E  E  (hi  ~  At)  (hi  ~  At) 

thus  we  have 

u)t(1  —  Wji)Ct 
T  Nt 

=  EE  WT,t,i(hi  ~  A r)(hi  -  At)T 

t=l  i=l 
T— 1  JVt 

=  EE  WT,t,i(hi  At )  (,fl,i  ~  At)T 

t=l  i=l 


Algorithm  1  The  incremental  covariance  tensor  learning  al¬ 
gorithm 


T 


l:  Given  Ct-,  Pt-,  Nt-,  Ct— i?  At— i5  ^t— i5  At— i,  L)t—1’  as 
well  as  wr,t,i  =  wT~t,w  G  [0, 1],  compute  CV: 

2:  Update  the  sum  of  sample  weights  up  to  time  T:  u)t  = 
,  wwt-i  +  At; 


3:  Update  the  squared  sum  of  normalized  sample  weights  up 
to  time  T:  w\  —  ((w‘^_1w‘^_1  —  AT-i)rc2  +  At)  /dj\\ 
4:  Update  the  weighted  mean  of  all  samples  up  to  time  T: 


Pt  = 


lAt-i  +  r: 


5:  Update  the  weighted  covariance  Ct  by  Theorem  1. 

6:  The  initial  conditions  are  C%  =  Ci,  Ai  =  yu  f&i  =  Ai, 
and  u)2  =  1/Ai. 


Nt 

+  ^  WT,t,i(fT,i  ~  At)(/t,z  -  At)T  (Lemmas  3  and  5) 

=  wwt— i(1  —  )Ct— i 

+  icCt-i  (At-i  —  At)(At-i  —  At)T  (Lemma  4) 

+  (At  —  1)Ct  +  Nt(ht  —  At)(/^t  —  At)T 
=  u;u)t-i(1  — 


5  Experiments 

In  our  experiments,  the  target  is  initialized  manually.  The 
tracking  parameters  are  tuned  on  one  sequence  and  applied 
to  all  the  other  sequences.  During  the  visual  tracking,  a  7- 
dimensional  feature  vector  is  extracted  for  each  pixel: 

(x,  y ,  R(x ,  yf  G{x ,  t/)  ,  T?(x ,  t/)  ,  Ix  (x,  yfly  (x,  ?/)) , 


+  (At  —  1)Ct  +  '^’'4)t-i(^-)2(mt  —  At-i)(mt  —  At-i) 

wt 

+  AT(TO~1)(/iT  -  At-i)(Mt  -  At-i)T  (Lemma  1) 
Wt 

=  wwt— i(1  —  )Ct— i  +  (At  —  1)Ct 


wwt—iNt 

wt 


(yr  —  At-i)(mt  —  At-i)T  • 


□ 

If  we  treat  all  samples  equally,  i.e.,  set  w  to  1,  we  can  obtain 
the  sample  covariance  of  F  from  (3): 


Ct 


At  —  1 
At  At 


+ 


At 


{(At— i  —  1)Ct— i  +  (At  —  1)Ct 
-(/iT  —  At- i)(yr  —  At-i)T} 


t where  (x,y)  is  the  pixel  location,  R,G,B  are  the  RGB  color 
values  and  I x ,  I y  are  the  intensity  derivatives.  Consequently, 
the  covariance  descriptor  of  a  color  image  region  is  a  7  x  7 
symmetric  matrix.  The  state  in  the  particle  filter  refers  to  an 
object’s  2D  location  and  scale,  namely  (x,y,s).  The  state 
dynamics  p(xt\xt-i)  is  assumed  to  be  a  Gaussian  distribution 
as  N(xt;  xt-i,  E),  where  E  is  a  diagonal  covariance  matrix 
whose  diagonal  elements  are  (cr2,  cr2,  cr2)  =  (52,  52, 0.022), 
respectively.  The  number  of  particles  is  set  to  100  for  our 
tracker  and  w  in  (3)  is  set  to  0.95.  The  observation  model 
p(yt\%t)  is  the  crucial  part  for  finding  the  ideal  posterior 
distribution.  It  reflects  the  similarity  between  a  candidate  sam¬ 
ple  and  the  learned  compact  covariance  tensor  representation. 
The  target  appearance  model  is  represented  by  M  modes 
Each  mode  Ci(xt )  of  the  candidate  sample  xt  is 
compared  with  the  corresponding  model  by  (1).  Thus  p(yt\xt) 
can  be  formulated  as: 


When  w  is  set  to  0,  Ct  is  equal  to  Ct,  which  means 
only  information  at  the  current  time  is  used  to  represent  the 
covariance  tensor. 

Expanding  CV-  i  in  Theorem  1  iteratively,  we  can  reformu¬ 
late  Ct  as  follows: 

T  T 

Ct  =  E  wt,cCt  +  ][>,>*  -  At-i)(Mt  -  At- i)T- 

t=  1  t= 2 


where  wt  c  = 


_  wT{Nt-l) 


_  w  wt—iNt 


—  w't)  ’  Wt~1WtWT(^  —  wf)  * 


It  is  interesting  to  see  that  our  formulation  is  a  mixture 
model  which  is  a  weighted  sum  of  all  the  covariance  up  to 
time  T  with  a  regularization  term,  and  the  weight  of  each 
kernel  covariance  is  adapted  dynamically. 

Consequently,  the  proposed  incremental  covariance  tensor 
learning  algorithm  is  shown  in  Algorithm  1. 


p{Vt\xt)  oc  exp(0:\T,fi1uip2[CT,i,Ci(xt)]}  , 

where  uji  is  the  weight  for  the  i-th  mode  (uoi  =  1/M  in 
our  experiments).  After  the  MAP  estimation,  we  use  the 
covariance  matrices  of  image  features  associated  with  the 
estimated  target  state  to  update  the  compact  covariance  tensor 
model  for  each  mode. 

By  our  definition,  each  particle  corresponds  to  an  up-right 
rectangle.  Therefore,  it  is  possible  to  improve  the  computa¬ 
tional  complexity  of  covariance  computation  using  the  integral 
histogram  techniques  [32].  After  constructing  tensors  of  inte¬ 
gral  images  for  each  feature  dimension  and  multiplication  of 
any  two  feature  dimensions,  the  covariance  matrix  of  any  ar¬ 
bitrary  rectangular  region  can  be  computed  independent  of  the 
region  size.  In  our  case,  28  integral  images  are  constructed  for 
fast  covariance  computation.  The  approach  was  implemented 
using  C++  and  performed  on  a  PC  with  a  1.6-GHz  CPU. 
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Computation  Times 


Fig.  3.  Speed  comparison  for  model  update. 

Without  code  optimization,  our  tracker  can  achieve  around  20 
fps  for  image  sequences  with  resolution  320  x  240. 

We  compared  the  proposed  ICTL  tracker  with  nine  state- 
of-the-art  visual  trackers,  namely,  generalized  kernel-based 
Tracker  (GKT)  [35],  multi-instance  learning  based  tracker 
(MIL)  [5],  incremental  PCA  based  tracker  (IVT)  [34],  online 
boosting  based  tracker  (OAB)  [11],  visual  tracking  decompo¬ 
sition  tracker  (VTD)  [18],  fragments  based  tracker  (Frag)  [1], 
color  based  particle  filtering  tracker  (CPF)  [31],  covariance 
tracker  (COV)  [33]  and  Mean  Shift  tracker  (MS)  [8].  In  our 
experiments  using  the  public  trackers  we  used  the  same  param¬ 
eters  as  the  authors.  Eleven  sequences,  most  of  them  have  been 
widely  tested  before,  are  used  in  the  comparison  experiments. 
The  quantitative  results  are  summarized  in  Table  1,  2,  3  and 
Fig.  10.  Below  is  a  more  detailed  discussion  of  the  comparison 
tracking  results. 

5.1  Speed  comparison  for  model  update 

From  (3),  it  is  clear  that  the  update  for  Ct  is  independent 
of  T  and  needs  only  0(d 2)  arithmetic  operations,  while 
the  computational  complexity  of  the  Riemannian  mean  used 
in  [33]  is  0(Td3).  In  our  experiment  setting,  when  T  =  50 
and  d  =  7,  the  computational  time  for  both  algorithms  are  0. 1 
ms  and  10  ms  respectively. 

The  computation  times  for  model  update  are  given  in  Fig.  3 
in  log-linear  scale.  The  figure  shows  that  the  proposed  ICTL 
has  a  constant  time  complexity  and  is  significantly  faster  than 
the  original  covariance  tracker. 

5.2  Qualitative  Evaluation 

Pedestrian  tracking.  We  first  test  our  ICTL  algorithm  to  track 
a  pedestrian  using  the  sequence,  crossing ,  couple ,  jogging , 
subway  and  woman. 

Fig.  4(a)  shows  the  comparative  results  on  crossing.  Al¬ 
though  the  target  has  the  similar  color  feature  as  the  back¬ 
ground,  our  tracker  is  able  to  track  the  target  well,  which 
can  be  attributed  to  the  descriptive  power  of  the  covariance 
feature  and  the  model  update  scheme.  Notice  that  the  non- 
convex  target  is  localized  within  a  rectangular  window,  and 
thus  it  inevitably  contains  some  background  pixels  in  its 
appearance  representation.  From  #48,  the  target  rectangular 
window  contains  some  light  pixels.  The  weighted  incremental 
model  update  adapts  the  target  model  to  the  background 
changes.  The  results  show  that  our  algorithm  faithfully  models 


the  appearance  of  an  arbitrary  object  in  the  presence  of  noisy 
background  pixels. 

Fig.  4(b)  shows  the  tracking  results  using  sequence  couple , 
captured  from  a  hand-held  camera.  The  couple  represents  a 
situation  of  group  tracking  where  one  or  more  objects  move 
together  in  a  sequence.  Notice  that  there  is  a  large  scale 
variation  in  the  target  relative  to  the  camera  (#3,  #139).  Even 
with  the  significant  camera  motion  and  low  frame  rate,  our 
ICTL  algorithm  is  able  to  track  the  target  better  than  other 
trackers  (see  Table.  1).  Although  our  tracker  loses  the  target 
in  #91  due  to  the  sudden  fast  camera  motion,  it  re-detects  the 
target  in  #116  and  tracks  the  target  to  the  end.  Furthermore, 
the  compact  tensor  representation  is  constructed  from  scratch 
and  is  updated  to  reflect  the  appearance  variation  of  the  target. 

Fig.  4(c)  shows  the  tracking  results  on  the  sequence  jogging. 
Note  that  our  ICTL  method  is  able  to  track  the  target  undergo¬ 
ing  gradual  scale  changes  (#22,  #300).  Further,  our  method  is 
able  to  track  the  target  with  severe  full  occlusion  (#68,  #77), 
which  lasts  around  20  frames.  Compared  with  the  results  of 
COV,  our  method  is  able  to  efficiently  learn  a  compact  repre¬ 
sentation  while  tracking  the  target  without  using  Riemannian 
means.  Moreover,  our  tracker  is  more  stable  when  the  target  is 
under  occlusion.  The  multi-mode  representation  and  Bayesian 
formulation  contribute  to  the  successful  performance. 

Our  algorithm  is  also  able  to  track  objects  in  cluttered 
environment,  such  as  the  sequence  of  a  human  walking  in  the 
subway,  shown  in  Fig.  4(d).  Despite  many  similar  objects  in 
the  scenario,  and  indistinctive  texture  feature  to  background, 
our  algorithm  is  able  to  track  the  human  well. 

Sequence  woman ,  as  shown  in  Fig.  4(e),  contains  a  woman 
moving  in  different  occlusion,  scale,  and  lighting  conditions. 
Once  initialized  in  the  first  frame,  our  algorithm  is  able  to  track 
the  target  object  as  it  experiences  long-term  partial  occlusions 
(#68,  #146,  #324),  large  scale  variation  (#540),  and  sudden 
global  lighting  variation  (#45,  #46).  Notice  that  some  parts  of 
the  target  are  occluded,  and  thus  it  inevitably  contains  some 
background  information  in  its  appearance  model.  The  multi- 
mode  representation  enables  the  tracker  to  work  stably  and 
estimate  the  target  location  correctly. 

Vehicle  tracking.  Sequence  race ,  as  shown  in  Fig.  5(a), 
contains  a  car  moving  in  different  scale  and  pose,  where  the 
background  has  a  similar  color  as  the  target.  Once  initialized 
in  the  first  frame,  our  tracker  is  able  to  follow  the  target  object 
as  it  experiences  large  scale  changes  (#4, #64, #254),  and  pose 
variations  (#4,  #185).  Notice  that  the  COV  tracker  cannot 
handle  scale  changes  and  is  not  stable  during  the  tracking 
sequence. 

Fig.  5(b)  shows  the  tracking  results  on  the  sequence  car. 
The  target  is  undergoing  long-term  partial  occlusions  (#165, 
#170),  which  lasts  around  40  frames,  and  large  scale  variation 
(#16,  #197).  In  this  sequence,  GKT  loses  the  target  quickly 
and  all  the  other  trackers  cannot  estimate  the  scale  as  well 
as  the  ICTL  method.  When  the  car  changes  its  pose  (#252) 
together  with  scale  variation,  only  our  tracker  can  follow  the 
target.  The  tracking  success  for  partial  occlusions  and  scale 
variation  results  from  the  part-based  representation  and  the 
proposed  model  update  approach. 

Fig.  5(c)  shows  the  tracking  results  on  the  sequence  turn- 


GKT  [35] 

MIL  [5] 

MS  [8] 

CPF  [31] 

COV  [33] 

IVT  [34] 

OAB  [11] 

VTD  [18] 

Frag  [1] 

ICTL 

car 

12.7242 

1.5211 

3.6382 

4.5717 

3.3430 

6.0876 

3.2358 

3.0916 

2.6072 

0.9118 

dog 

0.1933 

0.1350 

0.1757 

0.0946 

0.2124 

0.7230 

0.1843 

0.1372 

0.1434 

0.1671 

face 

0.1490 

0.2798 

0.1877 

0.2757 

0.4031 

0.1611 

0.2026 

0.1897 

0.1005 

0.1256 

race 

0.0768 

0.0541 

0.0784 

0.0931 

0.2537 

0.0317 

0.0523 

0.0320 

0.0533 

0.0456 

turnpike 

0.0213 

0.0210 

0.0271 

0.3961 

0.2563 

0.0080 

0.0091 

0.0051 

0.0168 

0.0127 

noise 

0.4706 

0.0199 

0.0539 

0.3000 

0.1406 

0.0081 

0.0065 

0.0061 

0.0258 

0.0209 

crossing 

0.4564 

0.0196 

0.0351 

0.2254 

0.0883 

0.1902 

0.0110 

0.0974 

0.2359 

0.0144 

couple 

1.3426 

1.0522 

3.4404 

2.6670 

0.4898 

2.1280 

3.0110 

2.5734 

1.0009 

0.3433 

jogging 

0.3069 

0.8211 

0.7028 

0.1885 

0.0865 

0.7808 

0.0570 

0.7916 

0.6383 

0.0364 

woman 

0.6305 

0.6972 

0.5714 

0.2813 

0.3178 

0.5846 

0.6700 

0.7337 

0.0664 

0.0366 

subway 

3.0896 

0.1061 

3.0340 

0.5036 

0.2772 

2.9237 

3.2289 

3.2080 

0.1577 

0.0880 

Ave. 

1.7692 

0.4297 

1.0859 

0.8725 

0.5335 

1.2388 

0.9699 

0.9878 

0.4588 

0.1639 

TABLE  1 

The  tracking  error.  The  error  is  measured  using  the  Euclidian  distance  of  two  center  points,  which  has  been 

normalized  by  the  size  of  the  target  from  the  annotation. 


GKT  [35] 

MIL  [5] 

MS  [8] 

CPF  [31] 

COV  [33] 

IVT  [34] 

OAB  [11] 

VTD  [18] 

Frag  [1] 

ICTL 

car 

0.0176 

0.3939 

0.2924 

0.2186 

0.2791 

0.4664 

0.3978 

0.4224 

0.3902 

0.5547 

dog 

0.2876 

0.3423 

0.3187 

0.3753 

0.2665 

0.1865 

0.2939 

0.3962 

0.3524 

0.3087 

face 

0.7346 

0.5792 

0.6714 

0.5800 

0.5138 

0.6901 

0.6572 

0.6301 

0.7822 

0.7118 

race 

0.4984 

0.5236 

0.4784 

0.4275 

0.3516 

0.6430 

0.5334 

0.6906 

0.5216 

0.6372 

turnpike 

0.6568 

0.6506 

0.6344 

0.1643 

0.3582 

0.7780 

0.7628 

0.8118 

0.7129 

0.7560 

noise 

0.2112 

0.6580 

0.4873 

0.1969 

0.5343 

0.7985 

0.7856 

0.7828 

0.6250 

0.6567 

crossing 

0.0133 

0.6078 

0.4876 

0.0836 

0.3285 

0.2696 

0.6258 

0.4076 

0.2941 

0.5947 

couple 

0.3422 

0.4396 

0.0517 

0.0363 

0.4337 

0.2129 

0.0675 

0.0647 

0.2317 

0.5125 

jogging 

0.3449 

0.1761 

0.1449 

0.4141 

0.5369 

0.1339 

0.5333 

0.1694 

0.1643 

0.6838 

woman 

0.0202 

0.0767 

0.0521 

0.0856 

0.0996 

0.0641 

0.0740 

0.0661 

0.5455 

0.5938 

subway 

0.1146 

0.5724 

0.0540 

0.1623 

0.3644 

0.0707 

0.0808 

0.0781 

0.5404 

0.5589 

Ave. 

0.2947 

0.4564 

0.3339 

0.2495 

0.3697 

0.3921 

0.4375 

0.4109 

0.4691 

0.5972 

TABLE  2 

The  tracking  quality.  The  quality  is  measured  using  the  area  coverage  between  the  tracking  result  and  the  annotation. 


#frame 

GKT  [35] 

MIL  [5] 

MS  [8] 

CPF  [31] 

COV  [33] 

IVT  [34] 

OAB  [11] 

VTD  [18] 

Frag  [1] 

ICTL 

car 

252 

246 

118 

123 

188 

163 

88 

118 

102 

127 

36 

dog 

127 

89 

71 

77 

69 

91 

95 

96 

56 

73 

80 

face 

890 

0 

201 

0 

140 

214 

0 

10 

2 

0 

0 

race 

320 

33 

22 

52 

46 

132 

0 

22 

0 

27 

43 

turnpike 

290 

0 

0 

0 

235 

142 

0 

0 

0 

14 

0 

noise 

290 

190 

0 

48 

202 

60 

0 

0 

0 

14 

0 

crossing 

120 

118 

0 

7 

114 

63 

71 

2 

44 

68 

0 

couple 

140 

58 

44 

128 

133 

39 

94 

128 

128 

95 

25 

jogging 

300 

119 

231 

232 

82 

43 

236 

35 

231 

232 

1 

woman 

542 

531 

478 

510 

483 

503 

474 

474 

490 

84 

48 

subway 

154 

125 

3 

147 

127 

70 

137 

136 

137 

15 

8 

Total 

3425 

1509 

1168 

1324 

1819 

1520 

1195 

1021 

1190 

749 

241 

TABLE  3 

Failed  tracking  statistics.  The  number  for  each  sequence  is  calculated  using  a  threshold  (1/3  is  used  to  generate  this 
table)  to  filter  the  area  coverage  between  the  tracking  result  and  the  ground  truth. 


pike.  The  color  based  CPF  tracker  drifts  off  the  target  from  #60 
and  then  quickly  loses  the  target.  Similarly,  the  COV  tracker 
also  loses  the  target  and  is  attracted  to  the  nearby  car  with 
similar  color. 

Noise.  To  test  the  robustness  to  noise,  Gaussian  noise  was 
added  to  sequence  turnpike  and  the  generated  sequence  is 
named  noise.  The  comparative  results  are  shown  in  Fig.  5(d). 
Compared  with  Fig.  5(c),  we  can  see  that  the  performance  of 
GKT  is  decreased  dramatically.  The  poor  performance  of  the 
GKT  is  because  its  appearance  model  is  not  robust  to  the  noise. 
Note  that  the  covariance  descriptor  is  robust  to  the  Gaussian 
noise  and  the  performance  of  our  tracker  is  almost  the  same 
as  noise-free  sequence. 


Long  term  sequence  tracking.  Long  term  sequence  track¬ 
ing  has  recently  drawn  many  researchers’  attention  [15]  due  to 
its  challenges  and  practical  applications.  We  test  the  proposed 
method  on  one  long  sequence,  doll  [24],  which  is  taken  by  a 
hand  held  camcorder  and  lasts  3871  frames.  Some  samples 
of  the  tracking  results  are  shown  in  Fig.  6.  It  shows  the 
tracking  capability  of  our  method  under  scale,  pose  changes 
and  occlusions. 

More  other  cases.  Fig.  7(a)  and  Fig.  7(b)  show  more 
tracking  results  on  the  sequence  face  and  dog ,  respectively. 
In  face ,  the  target  is  frequently  undergoing  long-term  partial 
occlusion.  Our  tracker  again  outperforms  all  the  other  trackers. 
The  successful  performance  can  be  attributed  to  the  adopted 
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(a)  crossing 


(b)  couple 


(c)  jogging 


(d)  subway 


(e)  woman 


GKT  MIL  MS — CPF  CQV  —  1VT  —  OAB  VTD  —  Frag  —  ICTL 


Fig.  4.  Pedestrian  tracking  results  of  different  algorithms. 


car 


turnpike 


(d)  noise 


Fig.  5.  Vehicle  tracking  results.  Legend  is  the  same  as  in  Fig.4. 


part-based  representation.  COV  performs  poorly  on  this  se¬ 
quence.  In  sequence  dog,  the  dog  is  running  and  undergoing 
large  pose  variation.  Although  our  tracker  cannot  estimate  the 
accurate  scale  of  the  target  due  to  the  severe  pose  change,  our 
ICTL  tracker  follows  the  dog  throughout  the  sequence. 


5.3  Qualitative  analysis  of  ICTL 

We  use  the  sequence  crossing  to  test  the  effectiveness  of  the 
proposed  ICTL.  Three  trackers  are  exploited  for  the  qualitative 
analysis:  Tracker- A  uses  the  proposed  ICTL  approach  with 
default  parameter  setting;  Tracker-B  uses  the  sample  covari- 
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Fig.  6.  Tracking  results  on  a  long  sequence,  doll.  There  are  pose,  scale  changes  and  occlusions  in  the  sequence. 


(a)  face 


(b)  dog 


Fig.  7.  Tracking  results  of  different  algorithms  on  face  and  dog.  Legend  is  the  same  as  in  Fig.4. 


ance  for  model  update,  namely,  the  parameter  w  in  Eq.3  is 
set  to  1;  and  Tracker-C  is  a  tracker  without  model  update. 
To  test  if  adding  more  features  could  improve  the  tracking 
performance,  we  construct  Tracker-D  by  adding  Tracker-C 
with  two  additional  features:  two  directional  second-order 
intensity  derivatives,  and  the  size  of  covariance  descriptor  for 
Tracker-D  is  9  x  9.  The  results  are  illustrated  in  Fig.  8.  As  can 
be  seen  in  the  figure,  when  the  target  window  includes  more 
background  clutter  (white  pixels),  Tracker-C  drifts  and  loses 
the  target  after  #77.  Tracker-B  drifts  from  #76  and  loses  the 
target  in  #79.  Even  with  more  visual  features,  Tracker-D  could 
not  track  the  target  robustly.  While  our  proposed  Tracker-A  is 
able  to  track  the  target  throughout  the  sequence.  The  success 
of  the  ICTL  performance  can  be  attributed  to  the  weighting 
scheme  adopted  in  the  proposed  ICTL. 

To  further  realize  the  tracking  performance  with  respect  to 
the  weight  selection,  we  carried  out  different  trackers  with 
different  weights  on  sequence  crossing ,  where  the  weight’s 
range  is  from  0  to  1  with  space  0.05.  This  is  illustrated  in 
Fig. 9.  We  can  see  that  an  improper  weight  may  degenerate 
the  performance.  Weights  in  the  range  [0.8,0.95]  may  be  a 
good  choice  for  the  tracker. 


Fig.  8.  The  effectiveness  test  of  ICTL  using  three  modifi¬ 
cation  of  ICTL:  Tracker-A  (white),  Tracker-B  (red),  Tracker- 
C  (blue)  and  Tracker-D  (yellow). 


5.4  Quantitative  Evaluation 

To  quantitatively  evaluate  all  the  trackers,  we  manually  labeled 
the  bounding  box  of  the  target  in  each  frame.  In  Table  1 
we  give  the  average  tracking  errors  of  each  approach  in 
all  sequences.  From  the  statistical  results,  we  can  see  that 
although  many  of  the  state-of-the-art  tracking  approaches  have 


tracking  performance  using  different  sample  weights  on  sequence  crossing 
0.7 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - r 

0.65  - 


weight:  w 


Fig.  9.  Tracking  performance  w.r.t  the  weight  selection. 


difficulty  tracking  the  targets  throughout  the  sequence,  our 
proposed  tracker  can  track  the  target  robustly. 

To  measure  the  tracking  quality  of  each  approach,  we 
use  the  area  coverage  between  the  tracking  result  and  the 
annotation  as  the  criterion.  The  range  of  this  measure  is  [0,1]. 
The  average  quality  is  shown  in  Table  2.  If  we  treat  the 
coverage  lower  than  1  /3  as  poor  tracking  result,  we  can  get  the 
poor  tracking  statistics  table  as  shown  in  Table  3.  We  can  see 
that  all  the  approaches  cannot  perform  well  on  dog  sequence 
due  to  the  target  is  undergoing  large  deformation  together  with 
scale  change,  car  and  race  are  also  challenging  sequences  due 
to  the  large  scale  variation.  Especially  on  jogging  and  woman , 
our  tracker  perform  much  better  than  other  trackers. 

Fig.  10  illustrates  the  tracking  error  plot  for  each  algorithm 
on  each  testing  sequence.  Each  subfigure  corresponds  to  one 
testing  sequence,  and  in  each  subfigure,  different  colored  lines 
represent  different  trackers.  Our  proposed  tracker  performs 
excellently  in  comparison  with  other  state-of-the-art  trackers. 

The  reason  that  our  ICTL  tracker  performs  well  is  three- 
folded:  1)  multiple  covariance  feature  matrices  are  used  to 
characterize  the  object  appearance;  2)  the  particle  filter  is 
adopted  for  posterior  distribution  propagation  over  time;  and 
3)  the  ICTL  learns  the  compact  covariance  model  to  handle 
appearance  variation. 
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Fig.  10.  The  tracking  error  plot.  Legend  is  the  same  as  in  Fig. 4. 


5.5  Discussion 

Our  proposed  tracker  is  based  on  the  multi-mode  represen¬ 
tation,  covariance  descriptor,  incremental  appearance  learning 
and  particle  filter.  The  robustness  of  the  tracking  performances 
are  joint  result  of  these  components.  In  particular,  multi-mode 
representation  addresses  partial  occlusion  and  scale  estimation; 
covariance  matrix  brings  rich  information  for  target  representa¬ 
tion;  and  particle  filter  is  more  powerful  than  searching  based 
approach.  That  said,  there  are  challenging  cases  our  tracker 
meet  problems,  such  as  when  dealing  with  severe  motion  blurs, 
large  and  fast  scale  change,  abrupt  motion  or  moving  out  of 
the  frame,  etc.  These  challenges  are  likely  to  happen  especially 
in  long  sequences.  Fig.  11  shows  some  failure  or  inaccurate 
tracking  results  of  the  proposed  tracker. 

Some  of  the  compared  trackers  are  without  a  model  update 
procedure,  such  as  the  CPF  tracker.  As  a  result,  they  cannot 
handle  the  appearance  variations  of  the  target.  Their  tracking 
performance  could  be  improved  by  adopting  some  advanced 
model  update  scheme,  such  as  the  approach  adopted  in  [26] 
for  the  CPF  tracker.  This  may  also  give  a  good  motivation  of 
choosing  covariance  descriptor.  We  would  investigate  this  in 
our  future  work. 

To  fairly  compare  different  trackers  is  not  an  easy  work.  Dif¬ 
ferent  evaluation  criterion  may  generate  different  performance. 
For  example,  on  the  sequence  subway  center  error  criterion 
does  not  consistent  with  area  coverage  criterion.  Center  error 
criterion  is  widely  used  in  visual  tracking  domain  while  area 
coverage  criterion  is  commonly  used  in  object  detection  area. 
From  the  performance  generated  by  area  coverage  we  can  get 
much  more  information  than  center  error,  e.g.  the  quality  of 
tracking.  Therefore,  we  think  the  area  coverage  is  a  better 
criterion  for  tracking  performance  measurement. 

6  Conclusion 

In  this  paper,  we  presented  a  real-time  probabilistic  visual 
tracking  approach  with  incremental  covariance  model  updat¬ 
ing.  In  the  proposed  method,  the  covariance  matrix  of  image 
features  represents  the  object  appearance.  Further,  an  incre¬ 
mental  covariance  tensor  learning  (ICTL)  algorithm  adapts  and 
reflects  the  appearance  changes  of  an  object  due  to  intrinsic 
and  extrinsic  variations.  Moreover,  our  probabilistic  ICTL 
method  uses  a  particle  filter  for  motion  parameter  estimation, 
the  covariance  region  descriptor  for  object  appearance,  and 
with  the  use  of  integral  images  achieves  real-time  performance. 


Use  of  a  part-based  representation  of  the  object  model  in  addi¬ 
tion  to  the  ICTL  and  Bayesian  PF  updates  also  affords  tracking 
through  scale,  pose,  and  illumination  changes.  Compared  with 
many  state-of-the-art  trackers,  the  proposed  algorithm  is  faster 
and  more  robust  to  occlusions  and  object  pose  variations. 
Experimental  results  demonstrate  that  the  proposed  method 
is  promising  for  robust  real-time  tracking  for  many  security, 
surveillance,  and  monitoring  applications. 

The  proposed  probabilistic  tracker  is  more  suitable  for 
multi-target  tracking.  Due  to  the  integral  images  used  for 
fast  calculations  of  covariance  matrix,  when  tracking  multi¬ 
objects,  the  computational  cost  grows  less  than  the  linear 
of  the  tracked  target  number.  When  covariance-based  object 
detector  [40]  is  used  to  initialize  the  targets,  the  computational 
cost  would  lower  than  the  independent  detector  and  tracker. 
This  is  because  the  detector  shares  the  same  base  features 
(integral  images)  with  the  tracker.  Furthermore,  the  boosted 
particle  filter  [27]  can  be  used  to  improve  the  multi-object 
tracking  performance. 


Appendix  A 
Proof  of  all  lemmas 

Proof  of  Lemma  1: 


T  Nt  T  T—l 

Wj1  =  NtwT~t  =  Y  NtWT~t  +  Nt. 

t=  1  i=  1  t= 1  t=  1 

Since  wt-i  =  Y^t=i  ^twT-i-u  we  have  wt  =  wwt- i  + 

Nt-  Thus 


T  Nt 


T  Nt 


_  y^  y^  WT,t,i 2  _  1  y^  y^ 


„2  (T-t) 


t=  1  i=  1 
T 


T  t=  1  i=  1 


T—l 


-L  5>t«,2(r-*)  =  NY  Ntw^+NT)  . 


T 


t=  1 


T 


t=l 


Since  w\_x  =  ^-^(^=1  Ntw2{T  1  +  NT-i)  ,  and 

YlJ=i  Ntw2^T~^  =  (wt-i™t-i  ~  NT-i)w2  ,  we  have 


2  _  {^t-iwt_i  ~  Nt-i)w2  +  Nt 
1  (wwt— 1  +  Nt)2 


□ 
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Fig.  1 1 .  Some  failed  or  inaccurate  tracking  results  by  the  proposed  tracker. 


Proof  of  Lemma  2: 

T  Nt  T  Nt  T  Nt 

EE  WT,t,i(ft,i—  At)  =  EE  EE 
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